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ABSTRACT 

Application-specific  coprocessors,  including  those  for  cryp¬ 
tography  and  compression,  can  provide  significant  acceler¬ 
ation  and  power  savings  to  programs  requiring  their  ser¬ 
vices.  While  most  coprocessors  have  traditionally  been  con¬ 
structed  as  a  separate  chip  connected  to  the  main  CPU  over 
a  relatively  slow  bus  connection,  3D  integration,  providing 
a  more  direct  connection,  is  an  emerging  technology  that 
offers  significant  performance  advantages  and  power  savings 
over  such  systems.  With  3D  integration,  two  or  more  dies 
can  be  fabricated  separately  and  later  combined  into  a  3D 
integrated  circuit  (3D  IC),  a  single  stack  of  two  or  more  dies 
connected  by  vertical  conductive  posts. 

We  propose  a  novel  coprocessor  architecture  in  which  one 
layer  houses  application-specific  coprocessors  for  cryptogra¬ 
phy  and  compression,  which  provide  acceleration  for  appli¬ 
cations  running  on  a  general-purpose  processor  in  another 
layer.  A  compelling  application  for  such  a  system  is  one  that 
performs  real-time  trace  collection,  compressing  the  trace 
prior  to  its  transmission  to  permanent  off-chip  storage  for 
offline  program  analysis.  Furthermore,  an  optional  encryp¬ 
tion  step,  performed  by  the  cryptographic  circuitry  in  the 
coprocessor  layer,  can  protect  this  compressed  data  from 
interception.  In  another  application,  a  high-performance 
stand-alone  encryption  service  can  be  provided. 
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1.  INTRODUCTION 

We  present  a  3D  architecture  for  the  real-time  transforma¬ 
tion  (compression  or  encryption)  of  a  stream  of  data.  A  3D 
IC  data  transformation  processor  is  useful  for  collecting  ex¬ 
ecution  traces,  e.g.,  for  reverse  engineering  of  malicious  soft¬ 
ware,  and  post-mortem  analysis  of  a  system  that  has  suffered 
an  attack.  Because  of  the  reduced  wire  length  made  possi¬ 
ble  by  stacking,  a  3D  architecture  offers  latency  advantages 
over  traditional  coprocessors  that  are  packaged  separately 
and  connected  at  the  circuit  board  level  or  traditional  2D 
chips  that  combine  a  CPU  and  a  coprocessor  on  the  same 
die.  The  CPU  layer,  or  computation  plane,  can  be  sold  to 
ordinary  customers  without  the  coprocessor  layer,  or  control 
plane,  attached,  but  customers  with  high  trustworthiness  or 
high  performance  requirements  can  purchase  the  joined  unit. 
Moreover,  the  coprocessor  layer,  alone,  can  be  manufactured 
in  a  trusted  foundry  to  provide  the  requisite  trustworthiness 
to  the  combined  system,  foregoing  the  expense  of  using  a 
trusted  foundry  for  the  CPU  layer.  This  approach  has  the 
potential  to  improve  the  economic  feasibility  of  trustworthy 
system  acquisition. 

For  each  design  parameter  for  our  proposed  design,  we  jus¬ 
tify  our  choices  based  on  analysis  of  real  3D  systems  and 
2D  data  transformation  processors  described  in  the  litera¬ 
ture.  We  also  used  binary  instrumentation  to  generate  trace 
files  from  the  computation  plane,  which  we  then  compress 
in  order  to  compare  the  compression  ratios  for  a  variety  of 
design  variables  and  trace  compression  algorithms.  Key  de¬ 
cision  factors  for  our  design  include: 

•  Semiconductor  manufacturing  process  (e.g.,  45nm) 

•  The  components  in  the  control  plane 

•  The  type  of  interface  between  the  two  dies 

•  The  method  of  coordination  between  the  two  dies 

•  Type  of  communication  interface  within  the  control 
plane 


•  Method  of  delivery  of  I/O  and  power 

•  Size,  type,  and  number  of  computation  plane  compo¬ 
nents 


Method  of  clock  synchronization  between  planes 
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2.  BACKGROUND 

3D  integration  is  an  emerging  chip  fabrication  technique  in 
which  multiple  integrated  circuit  dies  are  joined  using  con¬ 
ductive  posts.  3D  integration  offers  several  performance 
and  security  advantages,  including  extremely  low  latency 
and  high  bandwidth  between  the  two  dies  and  the  ability 
to  augment  a  processor  with  tightly-coupled  custom  fea¬ 
tures.  Other  advantages  include  lower  power  consumption, 
the  ability  to  join  disparate  technologies,  and  the  ability  to 
control  the  lineage  of  a  subset  of  the  dies,  e.g.,  by  using  a 
trusted  foundry. 

Dies  can  be  bonded  using  face-to-face  or  face-to-back  tech¬ 
niques.  With  face-to-face  bonding,  the  metal  layers  of  the 
two  dies  are  facing  each  other,  and  die-to-die  vias  are  used 
to  connect  the  two  layers.  Only  two  dies  can  be  joined  in 
this  fashion.  With  face-to-back  bonding,  the  bulk  silicon 
substrate  (back)  of  one  die  is  joined  with  the  metal  layers 
(face)  of  the  other  die,  and  through-silicon  vias  (TSVs)  con¬ 
nect  the  metal  layers  of  both  dies.  TSVs  are  relatively  larger 
than  face-to-face  vias.  Face-to-back  bonding  can  join  more 
than  two  dies. 

Traditional  2D  cryptographic  coprocessors  can  be  connected 
to  general-purpose  processors  at  the  circuit-board  level,  or, 
in  a  multi-core  system-on-chip  (SoC),  at  the  chip  level.  Some 
processors  include  cryptographic  functions  in  the  instruction 
set  architecture  (ISA);  however,  the  ISA  is  frozen,  whereas 
with  3D,  a  wide  range  of  custom  cryptographic  and  com¬ 
pression  functions,  produced  in  a  trusted  foundry,  can  be 
integrated.  For  some  applications,  a  2D  implementation  is 
sufficient;  however,  other  applications  (e.g.,  Toshiba’s  Chip 
Scale  Camera  Module  [25],  see  Section  6:  Related  Work) 
may  require  the  high  bandwidth  and  low  latency  only  pos¬ 
sible  with  a  3D  implementation. 

3.  DESIGN  GOALS 

Our  proposed  architecture  has  two  major  goals:  (1)  high 
performance,  comparable  to  that  of  other  processors  in  the 
market;  and  (2)  the  ability  for  the  control  plane  to  gather 
and  compress  architectural  state  in  the  computation  plane 
at  runtime  and  send  this  compressed  trace  data  off  chip. 
It  is  impossible  to  track  all  registers  in  a  processor  due  to 
the  massive  volume  of  data  involved;  therefore,  one  must 
carefully  prioritize  what  data  to  monitor. 

Mysore  et  al.  [16]  propose  a  3D  architecture  for  profiling 
that  captures  many  different  signals  indicative  of  state  and 
changes  to  state,  e.g.,  memory  addresses,  memory  values, 
program  counter,  opcodes,  register  names,  cache  misses,  etc., 
in  order  to  be  sufficiently  flexible  for  a  variety  of  analysis 
techniques.  This  set  of  signals  yields  an  estimate  of  the 
number  of  inter-die  vias  for  sending  the  data  to  an  anal¬ 
ysis  engine,  and  they  estimate  that  to  collect  1024  bits  of 
profile  data  each  cycle  requires  1024  inter-die  connections. 
Important  signals  to  monitor  include  the  control  unit,  pro¬ 
gram  counter,  status  register,  instruction  register,  and  data 
addresses.  We  apply  the  results  of  their  study  to  help  us 
estimate  the  number  of  die-to-die  vias  required  for  our  pro¬ 
posed  design. 

The  following  section  describes  the  design  choices  necessary 
to  achieve  our  design  goals. 


4.  DESIGN  CHOICES 

This  section  describes  the  key  design  parameters  for  a  3D 
data  transformation  processor  and  a  justification  for  each. 

4.1  Manufacturing  Process 

The  range  of  choices  includes  face-to-face  bonding  and  face- 
to-back  bonding.  Decision  factors  include  the  number  of 
layers  required,  level  and  ease  of  testing  required,  and  via 
density.  Our  choice  is  to  use  face-to-face  bonding  because 
it  provides  testing  advantages,  greater  via  density,  and  the 
smallest  possible  distance  between  layers  [3] .  Also,  since  our 
design  does  not  require  more  than  two  layers,  face-to-back 
bonding  is  unnecessary. 

4.2  Compression  Algorithm/Hardware 

Many  compression  algorithms  and  hardware  implementa¬ 
tions  are  available.  The  decision  factors  include  the  compres¬ 
sion  ratio  possible  for  a  given  set  of  trace  files,  the  area  cost 
of  the  hardware  implementation,  and  the  needed  through¬ 
put.  The  optimal  compression  algorithm  and  hardware  de¬ 
pend  on  the  type  of  trace.  Our  choice,  based  on  the  com¬ 
pression  study  described  in  Section  4.6,  is  to  use  two-stage 
compression.  The  first  stage  is  filtering,  and  the  second  is 
general-purpose  (gzip). 

4.3  Cryptographic  Algorithm/Hardware 

The  range  of  choices  of  cryptographic  algorithm  and  hard¬ 
ware  implementation  includes  a  wide  variety  of  cryptographic 
primitives  and  hardware  implementations.  The  decision  fac¬ 
tors  include  security,  the  ability  to  support  a  variety  of  appli¬ 
cations  requiring  cryptographic,  area  cost,  and  throughput. 
Our  specific  choice  is  to  include  units  for  AES- 128,  SHA-1, 
and  SHA-512.  These  primitives  support  a  wide  variety  of 
applications,  e.g.,  networking,  and  they  provide  both  secu¬ 
rity  and  speed. 

4.4  Interface  between  Planes 

The  range  of  choices  for  the  interface  between  the  two  planes 
includes  whether  to  use  a  direct  connection  or  a  bus  as  well 
as  the  width  of  the  connection  (the  number  of  wires  corre¬ 
sponding  to  the  number  of  vertical  connections  required). 
The  decision  factors  include  how  well  the  selected  technol¬ 
ogy  supports  speed,  simplicity,  cost,  and  density  of  vias.  Our 
choice  is  to  use  128  vias  as  a  direct  connection  to  send  dy¬ 
namic  architectural  state  from  the  computation  plane  to  the 
control  plane,  in  order  to  access  (using  taps)  the  program 
counter  (64  bits)  and  memory  address  registers  (64  bits), 
which  provide  useful  data  for  the  dynamic  analysis  of  the 
memory  behavior  of  programs.  To  keep  the  number  of  vias 
manageable,  this  design  does  not  support  the  tapping  of  all 
architectural  state.  We  leave  to  future  work  the  develop¬ 
ment  of  a  general-purpose  interface  capable  of  supporting  a 
wider  range  of  program  traces. 

We  also  make  the  choice  to  use  a  32-via  direct  connection 
to  send  the  encrypted  and  compressed  stream  back  down 
to  the  I/O  interface  in  the  computation  plane  (compression 
reduces  the  number  of  required  vias  to  32).  A  direct  con¬ 
nection  is  faster  and  has  lower  cost  than  a  bus,  and  our 
single  producer,  single  consumer  scheme  does  not  need  the 
contention-resolution  provided  by  a  bus. 


4.5  Other  Issues 

We  summarize  the  other  main  design  considerations:  (1)  For 
the  mechanism  of  coordination  between  the  two  planes  and 
for  the  configuration/initialization  of  the  control  plane,  we 
choose  8-bit  (1-byte)  control  words,  stored  in  special  regis¬ 
ters,  along  with  a  single  via  to  signal  the  control  plane  when 
to  act  upon  the  control  word.  We  base  this  choice  on  its  sim¬ 
plicity  and  the  small  number  of  face-to-face  vias  required. 
(2)  For  the  interface  within  the  control  plane  between  the 
compression  and  cryptographic  coprocessors,  the  output  of 
the  compression  circuit  is  connected  to  the  input  of  the  cryp¬ 
tographic  circuit  because  compressing  encrypted  data  does 
not  yield  good  compression  ratios  [5].  (3)  For  the  delivery 
of  I/O  and  power  to/from  the  outside  world,  we  choose  to 
employ  the  existing  I/O  capability  of  the  computation  plane 
rather  than  building  a  dedicated  I/O  controller  in  the  con¬ 
trol  plane.  We  base  our  decision  on  its  simplicity,  low  cost, 
and  feasible  number  of  vertical  connections;  however,  we 
note  that  independent  I/O  and  power  delivery  to/from  the 
control  plane  would  be  useful  from  a  security  perspective, 
and  we  suggest  this  as  a  topic  for  future  work.  (4)  For  the 
computation  plane  hardware,  we  select  a  high-performance1 
general-purpose  processor  available  in  the  marketplace  in  or¬ 
der  to  study  real  application  workloads  and  realize  the  eco¬ 
nomic  advantages  of  dual  use  of  the  computation  plane;  this 
requires  modifying  the  CPU  to  support  the  optional  attach¬ 
ment  of  a  control  plane  [23].  (5)  For  clock  synchronization, 
we  choose  the  tree  network  method  shown  to  be  effective  in 
previous  research  [18]  to  provide  clock  signals  to  the  CPU, 
compression  coprocessor,  and  encryption  coprocessor,  using 
three  clock  buffers  between  the  two  planes. 

4.6  Compression  Study 

The  goal  of  our  compression  study  was  to  determine  the  op¬ 
timal  compression  strategy  for  a  set  of  real  execution  traces. 
We  used  TCgen  [2],  designed  by  Martin  Burtscher  specif¬ 
ically  for  generating  lossless  trace  compressors  from  user¬ 
generated  descriptions,  to  compress  a  set  of  trace  files  we 
generated  using  Pin  [14].  Our  trace  files  capture  the  mem¬ 
ory  access  behavior  of  the  Linux  applications  Firefox,  Gimp, 
Mozilla,  OpenOffice,  and  Opera,  and  they  have  fields  for 
instruction  count,  program  counter,  memory  address,  and 
size.  TCgen  compresses  each  field  individually  rather  than 
compressing  all  of  the  fields  together. 

For  each  field,  we  varied  the  parameters  of  TCgen,  including 
the  algorithm  and  the  size  of  the  data  structures  internal  to 
each  algorithm.  Algorithms  available  in  TCgen  include  Last 
Value  (LV),  Stride  Predictor  (ST),  Finite-Context-Method 
(FCM),  and  Differential-Finite-Context-Method  (DFCM). 
LV  and  ST  use  one  internal  table,  but  FCM  and  DFCM 
use  two  internal  tables.  Therefore,  for  FCM  and  DFCM,  we 
vary  the  sizes  (number  of  columns)  of  both  tables,  and  for 
LV  and  ST  we  only  vary  one  table. 


1  Dissimilarity  between  the  computation  plane  and  the  con¬ 
trol  plane  presents  significant  engineering  challenges,  e.g., 
if  the  control  plane  uses  a  different  technology  node  than 
the  computation  plane.  It  may  be  necessary  to  instantiate 
multiple  instances  of  the  compression  hardware  to  keep  up 
with  the  extremely  fast  computation  plane  and  to  carefully 
design  the  control  plane  buffers  that  receive  data  from  the 
computation  plane. 


Number  of  Columns  in  Table  (n)  1 

Algo. 

1 

2 

3 

4 

5 

6 

7 

dfcm7[n] 

7.22 

5.42 

5.41 

5.40 

5.98 

6.02 

5.99 

dfcm6[n] 

.578 

.366 

0 

0 

.004 

.002 

.002 

dfcm5[n] 

.956 

.522 

.522 

.522 

.526 

.526 

.526 

dfcm4[n] 

9.10 

.002 

0 

0 

0 

0 

0 

dfcm3[n] 

.016 

2.08 

2.08 

2.08 

2.12 

2.12 

2.12 

dfcm2[n] 

.062 

.118 

.232 

.192 

.698 

.694 

.692 

dfcml[n] 

53.6 

57.1 

57.3 

53.7 

69.8 

69.8 

69.8 

fcm7[n] 

0 

0 

.188 

0 

0 

0 

0 

fcm6[n] 

0 

0 

0 

0 

0 

0 

0 

fcm5[n] 

0 

0 

0 

.002 

0 

0 

0 

fcm4[n] 

0 

0 

0 

0 

0 

0 

0 

fcm3[n] 

.004 

0 

0 

0 

0 

0 

0 

fcm2  [n] 

.008 

.006 

.004 

.012 

.004 

.004 

.004 

fcmljn] 

24.9 

32.5 

32.6 

18.6 

19.6 

19.6 

19.6 

st[n] 

0 

0 

0 

0 

0 

0 

0 

wm 

IKIIsM 

|  unpred. 

3.62 

1.88 

L72 

19.5 

L25 

mm 

Table  1:  Number  of  correct  predictions  (%)  for  each 
configuration  of  TCgen  while  compressing  the  pro¬ 
gram  counter  field  (average  of  all  five  trace  files) 


Since  TCgen  is  prediction-based  compression,  the  number 
of  correct  predictions  indicates  the  effectiveness  of  the  com¬ 
pression.  Table  1  shows  the  number  of  correct  predictions 
for  each  configuration  of  TCgen  when  compressing  just  the 
program  counter  field  (average  of  all  five  trace  files).  The 
rows  of  this  table  correspond  to  different  compression  algo¬ 
rithms,  and  the  columns  correspond  to  the  size  of  the  al¬ 
gorithm’s  internal  table  (n).  Algorithms  include  last  value 
(LV),  stride  predictor  (ST),  finite-context-method  (FCM), 
and  differential-finite-context-method  (DFCM). 

LV  and  ST  use  one  table;  for  algorithms  that  use  two  in¬ 
ternal  tables  (FCM  and  DFCM),  we  vary  the  sizes  of  both 
tables.  For  example,  fcml[n]  indicates  the  FCM  algorithm, 
where  its  first  table  has  one  column  and  its  second  table 
has  n  columns.  Unpredictable  corresponds  to  those  symbols 
in  the  trace  that  were  never  predicted  correctly.  For  com¬ 
pressing  the  program  counter  field,  the  configuration  with 
the  greatest  percentage  (69.8%)  uses  DFCM  with  one  col¬ 
umn  in  its  first  internal  table  and  five  columns  in  its  second 
internal  table.  We  found  that  DFCM  is  also  effective  for  the 
other  fields. 

After  applying  TCgen,  we  then  apply  a  general-purpose 
compression  algorithm  (gzip)  to  compress  the  trace  file  fur¬ 
ther.  On  average,  TCgen  (the  first  stage)  improves  the  com¬ 
pression  ratio  of  gzip  (the  second  stage)  from  46.5  to  58.9  for 
this  set  of  trace  files.  The  design  implication  of  our  study 
is  that  the  compression  unit  should  use  a  two-stage  com¬ 
pression,  with  the  first  using  TCgen/DFCM  and  the  second 
using  gzip.  Figure  1  shows  the  data  in  Table  1  graphically. 
Figure  2  shows  the  results  when  compressing  the  data  ad¬ 
dress  field.  For  this  field,  the  differential  property  of  DFCM 
does  not  contribute  to  the  compression  since  data  addresses 
do  not  have  a  fixed  stride.  Therefore,  we  recommend  using 
the  FCM  algorithm  for  data  addresses. 
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Figure  1:  Number  of  correct  predictions  (%)  for  each  configuration  of  TCgen  while  compressing  the  program 
counter  field  (average  of  all  five  trace  files) 
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Figure  2:  Number  of  correct  predictions  (%)  for  each  configuration  of  TCgen  while  compressing  the  data 
address  field  (average  of  all  five  trace  files) 
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Figure  3:  Compression  ratio  for  the  program  counter  field.  Our  two-stage  proposal  (DFCM  +  GZIP)  has  a 
slight  advantage  over  a  single  GZIP  stage. 
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Figure  4:  The  low  percentage  of  good  prections  in  the  data  address  field  results  in  a  poor  compression  ratio 
for  our  proposed  design.  The  first  stage’s  algorithm  must  be  carefuly  chosen  in  order  to  achieve  a  better 
compression  ratio. 


5.  SYSTEM  ARCHITECTURE 

In  this  section,  we  apply  the  results  of  our  design  choices 
and  our  compression  study  to  build  a  3D  transformation 
engine.  We  use  the  following  circuit-level  primitives  [23]  to 
allow  the  control  plane  to  interact  with  the  computation 
plane:  disabling,  tapping,  rerouting,  and  overriding.  This 
architecture  requires  128  direct  links  (128  bits)  between  the 
computation  and  control  planes  to  access  (using  taps)  the 
program  counter  (64  bits)  and  memory  address  registers  (64 
bits).  These  direct  links  lead  from  various  locations  in  the 
computation  plane  to  the  compression  circuit  in  the  control 
plane. 

Figure  5  depicts  a  block  diagram  of  the  computation  plane, 
showing  the  control  unit  of  the  microprocessor,  program 
counter,  memory  address  register,  cache,  clock  (for  synchro¬ 
nizing  the  two  planes),  I/O  interface,  and  I/O  controller 
(to  send  the  compressed  stream  off-chip).  The  compressed 
and  encrypted  stream  is  sent  back  to  the  computation  plane 
along  another  set  of  face-to-face  vias  to  eliminate  the  need 
for  a  separate  I/O  capability  of  the  control  plane.  As  the 
computation  plane  may  be  considered  an  unsafe  environ¬ 
ment,  we  provide  the  option  of  encrypting  the  trace  file  be¬ 
fore  it  is  exported. 

The  main  components  of  the  control  plane  are  the  micro¬ 
processor  interface  (Figure  10),  the  compression  coprocessor 
(Figure  7),  and  the  cryptographic  coprocessor  (Figure  6). 
The  control  plane  also  uses  buffers  to  ensure  that  compres¬ 
sion  proceeds  smoothly  without  stalling  the  processor  or 
dropping  data  [15].  In  addition  to  transferring  data  between 
layers,  clock  signals  and  query/control  signals  must  also  be 
transmitted.  The  microprocessor  interface  (Figure  10)  in  the 
control  plane  manages  the  query  and  control  signals,  which 
include:  a  clock  signal,  a  read/write  signal,  an  address/data 
byte,  and  externally  accessible  registers  to  send/receive  the 
signals.  The  registers  include  error,  status,  interrupt,  com¬ 
mand,  and  reset.  Figure  9  shows  the  integration  of  the  com¬ 
putation  plane,  microprocessor  interface,  compression  unit, 
and  cryptographic  unit  into  a  full  system. 

The  read/ write  signal  uses  one  or  zero  to  indicate  a  read  or 
write.  The  address/data  signals  use  one  byte.  The  two  most 
significant  bits  address  the  interface  register,  the  next  bit 
specifies  whether  the  signal  is  for  the  compression  or  cryp¬ 
tographic  hardware,  and  the  last  five  bits  are  the  instruction, 
supporting  32  query/control  instructions  for  each  coproces¬ 
sor.  For  synchronization  of  the  CPU  in  the  computation 
plane  and  the  two  coprocessors  in  the  control  plane,  we  use 
a  three-level  buffer  clock  distribution  network,  which  helps 
reduce  transmission  time  [6,  18].  The  compression  copro¬ 
cessor  uses  Content  Addressable  Memories  (CAM),  which 
allow  multiple  comparisons  to  be  made  in  parallel  [17]. 

The  compression  processor  uses  two-stage  compression:  the 
first  uses  the  Differential  Finite  Context  Method  (DFCM)  of 
TCgen  [2];  the  second  stage  uses  gzip.  Figure  8  shows  a  block 
diagram  of  the  gzip  module.  A  FIFO  buffer  is  used  to  avoid 
stalling  the  processor  and  to  smooth  out  speed  variations 
due  to  warm  up  time.  The  64-bit  output  is  sliced  into  32 
bits  prior  to  being  sent  to  the  encryption  unit,  i.e.,  each 
64-bit  value  is  split  into  two  32-bit  values. 


The  encryption  unit  is  inspired  by  the  HSSec  cryptographic 
coprocessor  [10].  It  supports  AES-128,  SHA-1,  and  SHA- 
512,  selected  for  their  security,  speed,  low  power,  and  their 
ability  to  support  a  variety  of  applications.  The  control 
unit  manages  data  processing  and  communication  with  the 
compression  processor  and  microprocessor  interface.  The 
AES-128,  SHA-1,  and  SHA-512  units  use  a  common  64-bit 
global  data  bus.  The  key  scheduler  is  used  for  key  expan¬ 
sion  and  generating  message  schedules.  The  memory  block 
consists  of  a  register  file,  padding  unit,  and  S-boxes.  The 
mode  interface  is  responsible  for  modifying  the  input  to  the 
cryptographic  primitives.  The  key  scheduler  performs  the 
RotWord  and  Sub  Word  transformations  and  provides  con¬ 
stants  needed  by  the  hash  functions:  SHA-1  uses  a  sequence 
of  80  32-bit  words,  and  SHA-1  uses  a  sequence  of  80  64-bit 
words. 

6.  RELATED  WORK 

Vasudevan  et  al.  have  developed  the  XTRec  primitive  for 
recording  the  instruction-level  execution  trace  of  a  commod¬ 
ity  computing  system  while  simultaneously  ensuring  the  in¬ 
tegrity  of  the  recorded  information  on  commodity  platforms 
without  requiring  software  modifications  or  specialized  hard¬ 
ware  [24].  Such  a  primitive  can  be  used  to  perform  post¬ 
mortem  analysis  for  forensic  purposes.  Our  work  differs 
from  XTRec  in  that  we  are  proposing  a  specialized  3DIC 
approach,  and  we  argue  that  our  proposed  sytem  would  fa¬ 
cilitate  the  capture  of  additional  activity  besides  the  instruc¬ 
tion  trace,  at  higher  bandwidth,  in  exchange  for  the  higher 
cost  of  specialized  hardware. 

Many  3D  applications  have  been  built  successfully,  including 
3D  chips  for  imaging  [25],  medicine  [9],  particle  physics  [4], 
reconfigurable  hardware  [20],  and  high-performance  micro¬ 
processors  [1],  [19],  [13],  [12],  [11].  Previous  work  on  secu¬ 
rity  applications  of  3D  integration  includes  a  3D  design  for 
mitigating  access-driven  cache  side  channel  attacks  (and  the 
circuit-level  primitives  needed  to  support  such  designs)  [23] ; 
a  study  of  whether  individual  layers  must  be  independently 
trustworthy  for  the  system  of  joined  dies  to  provide  certain 
trustworthy  functions  [7[;  additional  primitives  and  design 
flow  modifications  to  support  security  in  3D  designs  [8] ,  and 
a  qualitative  security  analysis  of  a  new  class  of  3-D  crypto 
coprocessors  [22].  This  paper  builds  on  [22]  by  presenting 
a  specific  instance  of  a  data  transformation  processor  that 
combines  cryptography  and  compression. 

7.  CONCLUSION 

We  have  presented  an  architecture  for  a  3D  data  transforma¬ 
tion  processor  and  a  rationale  for  each  of  the  key  design  de¬ 
cisions,  including  a  compression  study  that  determined  the 
optimal  compression  algorithm  for  a  specific  set  of  traces 
generated  using  the  Pin  dynamic  binary  instrumentation 
tool.  We  leave  to  future  work  the  hardware  implementa¬ 
tion,  simulation,  FPGA  prototype,  and  3D  IC  realization  of 
the  design  in  silicon. 
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Figure  5:  Block  diagram  of  computation  plane 
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Figure  8:  Block  diagram  of  gzip  unit,  after  [21] 
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